Loan Approval Prediction¶

Executive Summary¶

This project builds supervised machine learning models to predict loan approval based on borrower features.

Key Steps: • Data preprocessing and missing value handling • Categorical encoding and scaling • Class imbalance handling using SMOTE • Model comparison (Logistic Regression & Random Forest)

Results: • Random Forest achieved 83% accuracy. • ROC-AUC score: 0.79 • SMOTE improved class balance and recall performance.

Business Insight: Random Forest model is recommended for deployment with threshold adjustment to reduce risky approvals.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

from imblearn.over_sampling import SMOTE
In [5]:
df = pd.read_csv("C:\\Users\\DeLL\\OneDrive\\Desktop\\Alfido Tech Internship\\loan_prediction.csv")
df.head()
Out[5]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y
4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y
In [7]:
df.shape
df.info()
df.isnull().sum()
df['Loan_Status'].value_counts()
<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 67.2+ KB
Out[7]:
Loan_Status
Y    422
N    192
Name: count, dtype: int64

Removing Loan_ID as it does not contribute to prediction.

In [10]:
df.drop(columns=['Loan_ID'], inplace=True)
In [14]:
categorical_cols = ['Gender','Married','Dependents','Self_Employed','Credit_History']

for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])
In [16]:
numerical_cols = ['LoanAmount','Loan_Amount_Term']

for col in numerical_cols:
    df[col] = df[col].fillna(df[col].median())
In [18]:
df.isnull().sum()
Out[18]:
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64
In [20]:
df['Loan_Status'] = df['Loan_Status'].map({'Y':1, 'N':0})
In [22]:
df['Loan_Status'].value_counts()
Out[22]:
Loan_Status
1    422
0    192
Name: count, dtype: int64
In [24]:
df = pd.get_dummies(df, drop_first=True)
In [26]:
df.head()
df.shape
Out[26]:
(614, 15)
In [28]:
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']
In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
In [32]:
X_train.shape
X_test.shape
Out[32]:
(123, 14)
In [34]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [36]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
In [38]:
pd.Series(y_train_smote).value_counts()
Out[38]:
Loan_Status
1    337
0    337
Name: count, dtype: int64
In [40]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_smote, y_train_smote)

y_pred_lr = lr.predict(X_test)
In [42]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_smote, y_train_smote)

y_pred_rf = rf.predict(X_test)
In [44]:
from sklearn.metrics import classification_report, roc_auc_score

print("Logistic Regression Report")
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_lr))

print("\nRandom Forest Report")
print(classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_rf))
Logistic Regression Report
              precision    recall  f1-score   support

           0       0.70      0.68      0.69        38
           1       0.86      0.87      0.87        85

    accuracy                           0.81       123
   macro avg       0.78      0.78      0.78       123
weighted avg       0.81      0.81      0.81       123

ROC-AUC: 0.7773993808049535

Random Forest Report
              precision    recall  f1-score   support

           0       0.74      0.68      0.71        38
           1       0.86      0.89      0.88        85

    accuracy                           0.83       123
   macro avg       0.80      0.79      0.80       123
weighted avg       0.83      0.83      0.83       123

ROC-AUC: 0.7891640866873064
In [46]:
y_prob = rf.predict_proba(X_test)[:,1]

custom_threshold = 0.6
y_custom = (y_prob >= custom_threshold).astype(int)

print(classification_report(y_test, y_custom))
              precision    recall  f1-score   support

           0       0.63      0.71      0.67        38
           1       0.86      0.81      0.84        85

    accuracy                           0.78       123
   macro avg       0.75      0.76      0.75       123
weighted avg       0.79      0.78      0.78       123

Business Interpretation¶

Random Forest performed slightly better than Logistic Regression with higher accuracy and ROC-AUC.

However, recall for rejected loans is relatively low (0.68), meaning some risky applications may still be approved.

For deployment, a higher decision threshold (e.g., 0.6) can be used to reduce financial risk, even if it slightly lowers recall.

Conclusion¶

The Random Forest model achieved the best overall performance with 83% accuracy and ROC-AUC of 0.79. While both models performed well in predicting approved loans, further tuning and threshold adjustment is recommended before deployment.